Technical Note TN2093: OpenGL Performance Optimization : The Basics

Optimization of OpenGL code has become an increasingly important activity in the development of OpenGL-based applications. This document is targeted towards OpenGL developers who are looking to improve the performance of their applications. Developers should have a fundamental knowledge of OpenGL programming and a familiarity with OpenGL on Mac OS X to fully utilize and understand the information presented here.

Before diving into code to start performance tuning an OpenGL application, it is best to examine the fundamentals of OpenGL optimization and develop a systematic approach to enhancing the performance of OpenGL applications. One of the first things to do is to launch the application (in a window, if possible) and run the 'top' command in a terminal window. This is the starting point for almost all performance analysis, as it indicates how much CPU time an application is using. This information can also be garnered from the CHUD tool 'Shark', however Shark could be considered somewhat overkill for this stage of the process. The idea here is to yield a baseline value for further tuning of OpenGL.

Following this would be running the application through the OpenGL Profiler and collecting statistics of how and where the application is spending its time with regards to OpenGL. Once these two quantities known, the actual amount of time spent in OpenGL can be used in conjunction with the amount CPU time in use to yield an approximate value for performance. These approximate times will allow the developer to first see how much CPU time the application is using and how much of that time is actually being spent in OpenGL.

The Performance Tuning Roadmap

Understanding OpenGL Performance

Perhaps the most important aspect of OpenGL performance is how it relates to overall application performance. Using the above data from the NSGLWindow sample, 19.5% of available CPU time is being used by the application. Of this 19.5%, 14.25% is being spent in OpenGL, while the remainder is being used by the application itself. The following equation illustrates the relationship of OpenGL performance to application performance:

Total OpenGL Performance Increase = (Total CPU Time Consumed) * (Percentage of Time Spent In OpenGL)

Note: The actual numerical values you place in a calculator are not the same as listed above. The real numbers would be these:

19.5 * (0.1425) = 2.77875

as you are taking a percentage of a percentage to yield a real-world value.

With this in mind, even if OpenGL were to become a total 'no op' (taking 0 time), the application would only see a 2.78% increase in performance. So if an application were running at 60 frames per second, it would then perform as follows:

New Framerate = Previous FPS * (1 + (Percent Performance Increase) = 60fps * (1.0278) = 61.67fps

The application has gained slightly less than 2 frames per second by reducing OpenGL from 15% to 0%. This clearly shows that the relationship of OpenGL performance to application performance is not linear and that simply reducing the amount of time spent in OpenGL may or may not offer any noticeable benefit in application performance.

Note: It is impractical to think that an OpenGL application can actually have zero CPU utilization and still do something marginally useful. The idea of reducing the overhead of OpenGL to zero is a strictly for demonstrating the concepts behind OpenGL optimization.

The following figure offers a graphical representation of the information presented above. It shows the relationship between OpenGL performance and overall application performance as well as how to determine the real-world values for performance data.

The scenario illustrated by the graphic above is a little more likely to be found in the real world, where OpenGL is taking up about 25% of the CPU time used by the application. The actual amount of time used is absent in this illustration, as it is not entirely necessary to demonstrate the concept. The idea here is that regardless of how much actual time is used by the application, OpenGL is taking only a relatively small percentage (25%) of that time while the application is using the remainder (75%).

Use Profiler, Driver Monitor and CHUD Tools

The previous section offered some tips and instructions for using the OpenGL Profiler to collect performance data for an OpenGL application. With Profiler, developers can see how much time is being spent in OpenGL, in which functions that time is being spent and function call traces for the application being analyzed. OpenGL Profiler contains many more features and functions, not just the ones mentioned previously. For a more complete description of the OpenGL Profiler, please visit the OpenGL Profiler web page.

These three tools, included with the Mac OS X Developer Tools installation, are of paramount importance when performance tuning OpenGL applications. They are capable of tracking down and illustrating many of the common performance problems found in OpenGL applications. Instead of duplicating a great deal of information in this document regarding these tools, included below is a list of links to the appropriate tools documentation.

When using Profiler, there are a couple things to keep in mind. The following is a short list of items to keep in mind when you start working with Profiler:

The OpenGL Driver Monitor can be overwhelming at first, so to get a better grasp on the data displayed, please take a look at the OpenGL Driver Monitor Decoder Ring. This document describes in moderate detail the various aspects of Driver Monitor and some of the more important statistics that can be examined within the application. For information on Driver Monitor itself, please visit the OpenGL Driver Monitor web page.

In this image of Driver Monitor running simultaneously with an OpenGL application, virtually all of the parameters and states of the driver can be viewed and analyzed. In this particular example, there are 4 different items currently being tracked by Driver Monitor; bufferSwapCount, clientGLWaitTime, hardwareSubmitWaitTime and hardwareWaitTime. The first one is relatively simple - the bufferSwapCount is the total number of buffer swaps performed by the driver. The second, clientGLWaitTime, is the amount of time the CPU is stalled by the client OpenGL driver while waiting for a hardware time stamp to arrive. This usually occurs while waiting for a texture update or the completion of a glFence() command. The third parameter, hardwareSubmitWaitTime, shows how long the CPU is stalled waiting to be able to submit a new batch of OpenGL commands. This is a particularly important function as it can offer some insight as to how much time is being wasted by the CPU waiting for the GPU to process the previously submitted command buffers. The last parameter, hardwareWaitTime, is a global indicator of how long the CPU is stalled while waiting for the GPU. This parameter encompasses parameters such as hardwareSubmitWaitTime and other hardware waiting situations.

The CHUD Tools are a suite of tools designed to assist developers in optimizing their code on Mac OS X. With regards to OpenGL applications, Shark is perhaps the most useful of these tools. Shark is a performance analysis tool that can help developers determine the location of performance problems at the code level. For more information on Shark, please reference the Using Shark documentation.

Finding and eliminating duplicate function calls and redundant state changes

One of the primary culprits for OpenGL performance issues is duplicate function calls. There are many forms of this particular problem, including redundant state settings and multiple flushes or swaps in a single frame. For instance, if one were enabling lighting with a call such as glEnable(GL_LIGHTING) or enabling texturing with glEnable(GL_TEXTURE_2D), these only need to be called once for enabling and/or once for disabling. A common scenario here is for an application to enable texturing and/or lighting every time through the drawing loop. Generally speaking, an application will only have to make state changes such as these once if they are to be used throughout the application and should be done in a dedicated setup routine. However, there are instances where texturing or lighting may need to be turned off and back on again (such as when drawing a wire-frame outline around a textured polygon) in order to accomplish a specific visual effect or drawing operation. In this case, isolated routines should be present that will change state only if necessary and should be done at the application level instead of in OpenGL itself.

It is important to understand that OpenGL does not perform any type of consistency checks or redundant state set checks. For instance, as in the example above, if a call is made such as glEnable(GL_LIGHTING) and subsequently, this same call is issued, OpenGL will not verify that the state is actually changing. It will simply update the state value of the supplied parameter, even if that value is identical to its current value. This is a design decision in the OpenGL specification and not implementation-specific. The additional code required to perform these checks, while useful for developers, would inevitably cause performance problems even for applications that were not doing such things.

State changes in OpenGL tend to be expensive and should be broken out into separate initialization or configuration routines. Placing these calls in draw loops, or functions executed by the drawing loops, has the effect of slowing down OpenGL performance due to unnecessary changes in state. Due to the fact that OpenGL is performance-minded, no conditional or error checking is performed on incoming state changes, so these calls will cost just as many cycles for redundant entries as they would for changing data.

As an example, the NSGLWindow sample was modified to include redundant glFlush() calls in one of the drawing routines (the drawCube() function had a glFlush() added directly after each glEnd() function was called, for a total of 2 extra calls added to the code). The modified executable was then run through OpenGL Profiler, where a function trace and statistics were collected. Here are the results of that data.

On first glance, the initial reaction could be that the application's drawing loop is being overdriven (large blocks of time spent in CGLFlushDrawable() are often an indication of such an event). Even so, glBegin() and glDeleteTextures() are both using more time in OpenGL than the command that we've duplicated.

Notice the number of glFlush() calls in the trace. This command is being issued after every block of drawing code, obviously in a rendering loop. These are both redundant and expensive in terms of performance.

Now that the redundant glFlush() calls have been removed, there is a marked decrease in time spent in OpenGL, specifically in CGLFlushDrawable() where that is over a 10% reduction in time spent. glFlush() dropped from 5% to less than 3%, and overall time spent in OpenGL dropped by a little under 2%. While these improvements aren't exactly earth-shattering, keep in mind that this is a relatively simple application that is far from pushing the limits of the graphics pipeline. In a more realistic setting, this could account for a significantly larger percentage of time.

It is interesting to note that based on the performance statistics in the last screenshot versus the previous statistics, glBegin() has increased by 5%. The reason for this is that in relation to the rest of the function calls, glBegin() is now taking up a larger percentage of that time, not that it is consuming more time itself.

Effective use of glFlush() and glFinish()

These two commands are both used to do essentially the same thing, that being to submit all queued OpenGL commands to the hardware for execution. The major difference between the two is that glFinish() blocks until all of those commands have been executed by the hardware, while glFlush simply waits until all the commands have been submitted. This fact alone makes it quite clear that glFinish() can cause much more serious problems than glFlush().

Problems centered around these two function calls are usually easy to track down. Incorrect use of these commands can cause stalls and slow downs, which inevitably result in poor application performance. This is usually displayed as stuttering, sluggish response and high levels of CPU utilization. A quick look through the statistics report from OpenGL Profiler should show where the problems lie, if glFlush() or glFinish() is to blame.

As one can imagine, glFlush() has a much less significant impact on performance than glFinish() does. In the quest for higher performance, glFinish() commands should be removed unless they are deemed to be absolutely necessary. glFlush() commands can be used as long as this is done so in an efficient manner. For instance, you could use glFlush() to force drawing updates at the end of a draw loop, but you would not want to do this right before a call to a buffer swapping command (such as aglSwapBuffers(), which contains an implicit glFlush() itself). For a more detailed description of these two commands, please reference the Q&A glFlush() vs glFinish(). This document offers a clear and decisive definition and discussion of glFlush() and glFinish and their impact on performance.

While the effect of glFlush() on performance was illustrated in the previous section, it is important to see the difference between glFlush() and glFinish() in this context. The following example shows what occurs when those same glFlush() commands are replaced with glFinish() commands.

Notice now that the most time-consuming function is glFinish(), even more so than CGLFlushDrawable(). Total time spent in OpenGL is now up to over 29% from the 24% recorded previously with the glFlush() commands. 61% of the time spent in OpenGL was expended in the glFinish() calls. If viewed with Driver Monitor, this data would most likely show a large percentage of time spent waiting on the hardware to finish executing all the commands submitted up to the point at which glFinish() was called.

Don't try to overdrive the graphics pipeline (with rendering timers)

Another common performance issue is an application's attempt to overdrive the drawing loops. This is normally done through the use of a timer that fires in rapid succession, calling the drawing loop each time it fires. Typically, the timer interval has been set to some exceptionally small value (such as 0.001 to yield 1000 executions per second). The effect of this is quite the opposite from what is often expected - CPU time is consumed at double or triple (sometimes much higher) what it would or should normally be and application performance is severely degraded. In this situation, it's best to either allow the system to regulate drawing (using -setNeedsDisplay: in Cocoa, for instance) or to use a much larger timer interval. If a larger timer interval is used, it's best to start with a value such as 0.01 (to yield 100 calls per second) and then adjust the timer interval up or down from that point to find the target framerate.

For more detail on this subject as well as a short code example, please see the NSTimers and Rendering Loops document. The code listed there offers a clear illustration of proper architecture of a rendering loop in Cocoa that driven by an NSTimer with a reasonable fire interval.

Understanding VSYNCH

Applications are usually synchronized with the vertical refresh (VBL, vertical blank or vsynch) in order to eliminate the problem of frame tearing. Frame tearing is a situation where part of a following frame overwrites previous frame data in the frame buffer before that frame has fully been rendered on the screen. The visual effect of this is that one will see perhaps half (more or less depending on the situation) of the new frame and remainder of the previous frame. Synchronizing to the vertical refresh eliminates this problem by only drawing a frame during the vertical retrace (when the electron gun is returning to its start point). This guarantees that only 1 frame will be drawn per screen refresh.

There are some caveats to doing this, however. First, the refresh only happens in integer factors of the current refresh rate of the monitor (60Hz, 30Hz, 15Hz, etc). The problem here is that OpenGL will block while waiting for the next vertical retrace which tends to waste time that could be spent performing other drawing operations.

Note: LCD screens do not have a "vertical retrace" in the conventional sense and are commonly seen to have a "fixed" refresh rate of 60Hz.

Reading pixels from the frame buffer

glReadPixels() by its very nature is an expensive function call. Therefore care must be taken when using glReadPixels() that it is done in the most effective and efficient way possible. This glReadPixels() call will cause a synch point to be placed in the command stream. This synch point forces a synchronization between the CPU and GPU, which can have the effect of stalling the rendering pipeline. When this occurs, performance is guaranteed to suffer while either CPU or GPU is waiting for the other to catch up.

As an alternative to glReadPixels(), you can also use asynchronous texture fetching. Essentially, asynchronous texture fetching uses the same pipeline as a texture upload with GL_APPLE_CLIENT_STORAGE and GL_APPLE_TEXTURE_RANGE, but reverses the order in which it is performed so as to perform a download instead of an upload. What this will do is DMA texture data from VRAM into an AGP texture which can then be accessed directly by the application. Using the storage hint GL_APPLE_CLIENT_STORAGE_HINT_SHARED will eliminate the driver copy of the texture, making it resident only in VRAM and thereby increasing throughput. To initiate the data transfer, a call is made to glCopyTexSubImage2d() followed immediately with a glFlush(). This puts texture data into the AGP texture and with a call to glGetTexImage(), this texture is transferred into system memory. It should be noted that it's best to wait until the last possible moment to execute the transfer from AGP to system memory, as that will allow the most time in between for additional processing.

Concluding Remarks

In conclusion, the above information should offer a solid foundation upon which to build a fast, optimized OpenGL application on Mac OS X. The important thing to remember here is that this is really the "tip of the iceberg", so to speak, and that there are numerous other methods and techniques that can be employed to further enhance application performance. Another issue to keep in mind is that all applications do not drive the graphics pipeline in the same manner, therefore different optimization techniques can be necessary depending on how the application is architected and how it is handling rendering.

Reference Section

Also, the OpenGL presentations from previous WWDC sessions are extremely valuable references for OpenGL performance. These are available on DVD to all developers who attend the conference.

Introduction